﻿ Masteratulde Lingvistică Computațională Curs: Introducerein Lingvistica Computațională Curs 10 Question-answering and how to evaluate an NLP system Curs: Dan Cristea Seminarii& proiect: Mihaela Onofrei, Dan Cristea Question-answering systems The discourse layer INITIAL textSYNTACTIC SUB-SYNTACTIC PROCESSINGPROCESSINGPROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSINGPROCESSINGresultPROCESSING Question-answering INITIAL textSYNTACTIC SUB-SYNTACTIC PROCESSINGPROCESSINGPROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSINGPROCESSINGresultPROCESSING DISCOURSE STRUCTURE COHESION & COHERENCE SUMMARISATION TEMPORAL PROCESSING QUESTION- ANSWERING Question Answering -Definition •Question Answering (QA): a QA system takes as input a question in natural language and produces one or more ranked answers from a collection of documents –By providing a small set of exact answers to questions, QA takes a step closer to information retrieval rather than to document retrieval Information Retrieval (IT) vs Document Retrieval (DR) •IR: finding information, usually matching some user- defined fix structure, in a collection of unstructured nature (usually texts) •DR: finding a set of relevant documents, satisfying some user-defined criteria, in a collection of unstructured nature (usually texts) Modules •In a classical approach, QA systems normally adhere to the pipeline architecture composed of three main modules: –question analysis –the results are: keywords, answer and question type, focus –paragraph retrieval -the results are a set of relevant candidate paragraphs/sentences from the document collection –answer extraction –the results are a set of candidate answers ranked using likelihood measures Question Type –Factoid –“Who discovered the oxygen?”, “When did Hawaii become a state?”or “What football team won the World Coup in 1992?” –List –“What countries export oil?”or “What are the regions preferred by the Americans for holidays?” –Definition–“What is a quasar?”or “What is a question-answering system?” –Other –how-questions, why-questions, hypothetical questions, polar (Yes/No) questions, cross-lingual questions Harabagiu, S & Moldovan, D (2003) Question Answering In R Mitkov(ed ) The 8 Oxford Handbook of Computational Linguistics Answer Type •Person-"What”, "Who”, "Whom", "With who" •Location(City, Country, and Region) -"What state/city“, "From where”, "Where“ •Organization-"Who produced“, "Who made“ •Temporal(Date and Year) –“When” •Measure(Length, Surface and Other) –“How much” •Count-"How many“ •Yes/No–“Did the girl fear that?”, “Is the car blue?” The search collection •Closed-domain –look for answers in local collections, internal organization documents, newspapers, etc –deals with questions from a specific domain (medical, baseball, etc ) –can exploit domain-specific knowledge (ontologies, rules, disambiguation) •Open-domain –look for answers on the internet –general questions about anything –can use general knowledge about the world History of QA systems •The first systems, created in the 60s –BASEBALL(Green et al , 1963) -answer questions about baseball games •Systems of the 70s –LUNAR(Woods & Kaplan, 1977) –geological analysis of rocks returned by the Apollo moon missions •Systems of the 80s: –IURES(Cristea et al , 1985) –medical domain, querying national programs library çsemantic network driven –QUERNAL(Cristea, 1987) –personnel database, drilling and extraction, metallurgy, geography çrules based (syntagma) QA systems of today •Powerset: http://www powerset com/(http://www bing com/) •Assimov –the chatbot: http://talkingrobot org/ •IBM Watson: https://www ibm com/watson –Djingo –the multi-service virtual assistant from Orange (controlled by voice or text): https://www orange com/en/Human-Inside/Shaping-tomorrow-s- world/SH2017/Djingo-your-multi-service-virtual-assistant •allows the user to navigate Orange TV, manage the connected home, make a call or access other services •Amazon Alexa… •Google play devices… QA Competitions òTREC(Text REtrievalConference) -started in 1992 http://trec nist gov/ òNational Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA òCLEF (Cross Language Evaluation Forum) started in 2000 -http://www clef-campaign org/ European languages in both monolingual and cross-language contexts òCoordination: Istituto di Scienza e Tecnologie dell'Informazione, Pisa, Italy CLEF 2011 –Input Data An excerpt from a ‘gold standard’file UAIC System –CLEF 2011 •Our group participated at CLEFexercises since 2006: –2006 –Ro–En (English collection) –9 47% right answers –2007 –Ro–Ro (Romanian Wikipedia) –12 % –2008 –Ro–Ro (Romanian Wikipedia) –31 % –2009 –Ro–Ro, En–En (JRC-Acquis) –47 2 % (48 6%) –2010 –Ro-Ro, En-En, Fr-Fr (JRC-Acquis, Europarl) –47 5% (42 5%, 27 %) 50 40 30 20 10 0 15 20062007200820092010 UAIC System components Background Test data (documents, knowledge questions, possible answers) Lucene index 1 Questions processing: Answers processing: - Lemmatization - Lemmatization - Stop words elimination - Stop words elimination - NEs identification - NEs identification - Lucene query - Lucene query documents Identify relevant documents Lucene indexes 2 Partial and global scores 16per answers Background knowledge indexing òThe Romanian background knowledge has 161,279 documents in text format ò25,033 correspond to the AIDS topic ò51,130 to Climate Change topic ò85,116 to Music and Society topic òThe indexing component processes the name of the file and the text of the article => Luceneindex 1 Test data processing –Processing questions •Stop words elimination •Lemmatization •Named Entity identification •Lucene query building Test data processing –Processing possible answers •Ontologies used (Iftene and Balahur, 2008), for instance, to locate cities (relation [is located in]) –In which European cities has Annie Lennox performed? –Answers containing non-European cities are eliminated (here: replaced with the value XXXXX) Searching using relevant documents for questions •Then in every index, queries associated to possible answers are asked for using Lucene •For every answer, a list of documents displaying Lucene relevance scores is obtained –Score2(d, a) is the relevance score for document dwhen searched with the Lucene query associated to answer a References •BASEBALL: Green, B F Jr et al (1963): Baseball: AnAutomatic Question Answerer, in Edward A Feigenbaum and Julian Feldman (eds),Computers and Thought, New York, McGraw-Hill Book Company, p 207-216 •LUNAR: Woods, W A , Kaplan, R (1977) Lunar rocks in natural English: Explorations in natural language question answering, in Linguistic structures processing •IURES: Cristea, D , Tufiș, D , Mihăescu, T (1985) IURES: A Computer Natural Language Question-Answering System with Possible Medical Applications, inRev Med Chir Soc Med Nat Iași, LXXXIX(3), p 511-516 •QUERNAL: Cristea, D (1987) Sistemul QUERNAL, In Giumale, C , Preoţescu, D , Şerbănaţi, L D , Tufiş, D , Cristea, D : LISP, pp 215-229, Editura Tehnică, Bucureşti •QA-UAIC@CLEF: Iftene, A & Balahur, A (2008) Answer validation on English and Romanian languages, in Workshop of the Cross-Language Evaluation Forum for European Languages, 448-451 How to evaluate an NLP system? NLP Systems need evaluation }“An important recent development in NLP has been the use of much more rigorous standards for the evaluation of NLP systems” Manning and Schutze }To be published, all research must: ◦establish a baseline, and ◦quantitatively show that it improves on the baseline and the state- of-the-art NLP Systems –ways of doing evaluation •Answer the question: “How well does the system work?” •Possible domains for evaluation –Processing time of the system –Space usage of the system –Human satisfaction –Correctness of results •Measures: (Accuracy, Error), (Precision, Recall, F-measure) Accuracy and Error •To evaluate means to compare the system output against a gold standard •The results of a system are marked as: –Correct: matches the gold standard –Incorrect: otherwise Accuracy and Error -Example òA system that detects the language òAccuracy = 66 66 % òError = 33 33 % Precision and Recall òPrecision and Recall are set-based measures òThey evaluate the quality of some set membership, based on a reference set membership òPrecision: what proportion of the retrieved documents is relevant? òRecall: what proportion of the relevant documents is retrieved? Precision and Recall –Example (1) relevant documents retrieved documents Precision = 4 / 10 = 40 % Recall = 4 / 14 = 28 57 % Precision and Recall –Example (2) relevant documents retrieved documents Precision = 14 / 20 = 70 % Recall = 14 / 14 = 100 % Precision and Recall –Example (3) relevant documents retrieved documents Precision = 6 / 6 = 100 % Recall = 6 / 20 = 30 % Precision and Recall –Example (4) relevant documents retrieved documents Precision = 0 / 6 = 0 % Recall = 0 / 14 = 0 % F-measure (F-score or F1-score) òF-measure is a measure of a test's accuracy, and it considers both the precision pand the recall r òGeneral formula: òF1-measure: Recomended environment of a module participating in a processing chain txt parameters outputinput Module Xinputoutput standardstandard resource resource standard If they observe the same standards, modules can be chained txttxt parameters Xparameters Y Module Xinputstandard XYModule Ystandard YZ standard resource Xresource Y resource resource standard Xstandard Y How to calibrate a module? •Suppose we want to build a module to achieve a particular goal Then, in fact, we will have to build 3 modules: –The Training Module (TM) –The module itself (X) –The Evaluation Module (EM) Training module (TM) •TM learns from a training corpus a model, which will subsequently be used by module X preferencesTraining pref Training ModuleTrainingmodel Corpus Module X •X applies an algorithm to an input to transform it conforming to the trained module preferencesX pref model The module Xoutput xml input xml Evaluation module (EM) •EM evaluates (compares) a Test fileagainst a Gold file preferencesEvaluation prefTest output xml Evaluation ModuleevalLog gold xml Evaluation measures •Precision = #common items in Test & Gold/#items in Test •Recall = #common items in Test & Gold/#items in Gold •F-measure = 2 * P * R / (P + R) Assembling an evaluation system preferencesTraining prefpreferencesX pref TMmodelXinput xml Training corpus output xml ld xml evalLog EMgo preferencesEvaluation pref Calibrate = iterate parameters until an optimum is reached TMconfiguration cfg preferencesTraining pref put xml Xin preferencesX pref gold xmlpreferencesEvaluation pref EM Training Optimal values Corpus C 10-fold evaluation ain on 9 parts Gold tr corpus iterate 10 times i evaluate on the ΣE i 10thpart => E10E1, E2 E10E = 10